This document contains dimensionality reduction contents.

Time Complexity of PCA, t-SNE, and UMAP

PCA

PCA (Principal Component Analysis) is a widely used technique for dimensionality reduction. The time complexity of PCA depends on the number of data-points n and the number of features p. In the worst-case scenario (based on the definition), the time complexity of PCA is O(p2n + p3), where n is the number of data-points and p is the number of features.

The computation of the covariance matrix contributes O(p2n) to the overall complexity, while the eigen-value decomposition adds O(p3). In R, using the prcomp function with scaling to implement PCA as SVD (Singular Value Decomposition) of the Dataset matrix results in a time complexity of O(n2p + p3).

For reference and more information: - StackOverflow - Complexity of PCA - Medium - Computational Complexity of PCA - SVD Implementation to Speed up PCA

t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is another dimensionality reduction technique. Its time complexity is O(n2), where n is the number of data-points.

For reference and more information: - t-SNE Paper (arXiv)

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a popular dimensionality reduction method. Its time complexity varies based on the dimension of the target reduced space d and the number of data-points n. In the average case, the complexity is O(d * n1.14), while in the worst-case scenario, it is O(n2).

For reference and more information: - UMAP GitHub Issue - ResearchGate - UMAP Publication

Dimensionality Reduction Code

These are the needed parameters for dim-reduce:

k <- 7
seed <- 1234
results_path <- "../results/dim-reduce/R"

# t-SNE Parameters
tsne_perplexity <- 40
tsne_max_iter <- 1000
tsne_initial_dims <- 50
tsne_seed <- seed

# UMAP Parameters
umap_n_neighbors <- 15
umap_metric <- "euclidean"
umap_min_dist <- 0.1
umap_seed <- seed
color <- "variant"
shape <- "year"
include_plots <- TRUE
### OPTIONAL ###
# filter1_factor <- "variant"
# filter1_values <- c("Omicron", "Omicron Sub")
# filter2_factor <- "year"
# filter2_values <- c("2023")
# GET KMERS FROM PRE-WRITTEN FILES (depends on strat_size)
# LOAD SOURCES #############################################
strat_size <- 100
kmers_data_path <- "../data/kmers"
k_path <- sprintf("%s/kmer_%d_%d.csv", kmers_data_path, k, strat_size)
message(sprintf("Reading %s for later... ", k_path), appendLF = FALSE)
## Reading ../data/kmers/kmer_7_100.csv for later...
kmers <- utils::read.csv(k_path)
message("DONE!")
## DONE!

Dimensionality Reduction Methods

Principal Component Analysis (PCA)

PCA is a popular method for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space. It identifies the principal components, which are linear combinations of the original features that capture the most variance in the data. In this code, PCA is used to reduce the dimensionality of the scaled data (x). The results are plotted in 2D and 3D PCA plots, screeplot, factor loadings plot, and many other more.

pre_reduce_res <- pre_reduce(results_path, kmers, k, filter1_factor, 
                               filter1_values, filter2_factor, filter2_values)
## [1] "Pre-processing and scaling data..."
df <- pre_reduce_res$df                # df is the original dataset
x <- pre_reduce_res$x                  # x is the scaled data

# Perform PCA
pca_df <- pca_fn(x)
## [1] "Performing PCA..."

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a nonlinear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data in a lower-dimensional space while preserving local structures. In this code, t-SNE is performed using the tsne or Rtsne library (based on user selection) on the PCA results. The function generates 2D and 3D t-SNE plots, which represent the data points in a way that maintains the similarity between nearby points.

# Perform t-SNE via 'tsne' library using PCA results (in 3 dimensions)
# # Note: Uncomment the next two line to use tsne; otherwise, comment them
tsne_df <- tsne_fn(pca_df$x, 3, tsne_initial_dims, tsne_perplexity,
                   tsne_max_iter, tsne_seed)

# Perform t-SNE via 'Rtsne' library using PCA results (in 3 dimensions)
# # Note: Uncomment the next two lines to use Rtsne; otherwise, comment them
# is_tsne <- FALSE
# tsne_df <- rtsne_fn(pca_df$x, 3, tsne_perplexity, tsne_max_iter, tsne_seed)

Uniform Manifold Approximation and Projection (UMAP)

UMAP is another nonlinear dimensionality reduction method that seeks to preserve both local and global structures of the data. It is known for its ability to scale to large datasets. In this code, UMAP is used to project the scaled data (x) into a 3D space. The function generates 2D and 3D UMAP plots, which provide a reduced representation of the data while preserving the relationships between data points.

# Perform UMAP (in 3 dimensions)
umap_df <- umap_fn(x, 3, umap_n_neighbors, umap_metric, 
                   umap_min_dist, umap_seed)

2D PCA plot:

  • This plot represents the data points projected onto the first two principal components obtained from PCA.
  • It helps visualize the overall distribution of the data in a 2D space while retaining the most significant variability in the data.
  • The plot may reveal clusters or patterns in the data.
# Generate 2D PCA plot
pcolor <- df[[color]]
pshape <- df[[shape]]

p <- autoplot(pca_df, data = df) +
  geom_point(aes(color = pcolor, shape = pshape, text = paste(
    "Identifier: ", df$gisaid_epi_isl, "\n",
    "Variant: ", df$variant, "\n",
    "Sex: ", df$sex, "\n",
    "Division Exposure: ", df$division_exposure, "\n",
    "Year: ", format(as.Date(df$date), "%Y"), "\n",
    "Strain: ", df$strain, "\n",
    "Pangolin Lineage: ", df$pangolin_lineage
  ))) +
  scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(color = pcolor, shape = pshape, text =
## paste("Identifier: ", : Ignoring unknown aesthetics: text
p <- ggplotly(p)
p

3D PCA plot (pca_3d):

  • Similar to the 2D PCA plot, but in a 3D space.
  • It provides additional depth to visualize the data points and explore the variation in the first three principal components.
  • It allows for the identification of clusters or patterns that might not be apparent in the 2D projection.
# Generate 3D PCA plot
p <- plot_ly(as.data.frame(pca_df$x[, 1:3]),
             x = ~PC1, y = ~PC2, z = ~PC3, type = "scatter3d",
             mode = "markers", color = df[[color]], symbol = df[[shape]],
             text = paste(
               "Identifier: ", df$gisaid_epi_isl, "\n",
               "Variant: ", df$variant, "\n",
               "Sex: ", df$sex, "\n",
               "Division Exposure: ", df$division_exposure, "\n",
               "Year: ", format(as.Date(df$date), "%Y"), "\n",
               "Strain: ", df$strain, "\n",
               "Pangolin Lineage: ", df$pangolin_lineage)
             )

p <- ggplotly(p)
p

Screeplot:

  • A screeplot is a line plot that shows the variance explained by each principal component.
  • It helps identify the number of principal components that capture most of the variance in the data.
  • The ā€œelbow pointā€ in the plot is often used to determine the appropriate number of principal components to retain.
# Generate screeplot
p <- fviz_eig(pca_df,
              xlab = "Number of Principal Components")

p <- ggplotly(p)
p
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## '_deprecated', 'alignmentgroup', 'base', 'basesrc', 'cliponaxis', 'constraintext', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'offset', 'offsetgroup', 'offsetsrc', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'width', 'widthsrc', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Factor Loadings plot (factor_loadings):

  • This plot shows the loadings of variables on the first three principal components.
  • It helps in understanding which variables contribute most to the principal components.
  • Variables with high absolute loadings contribute significantly to the components’ variation.
# Generate factor loadings plot of first 3 principal components
n_components <- 3

# Extract factor loadings
loadings <- pca_df$rotation

# Plot bar plots for factor loadings of n principal components
for (i in 1:n_components) {
  # Create a data frame for the factor loadings
  loadings_df <- data.frame(
    variable = colnames(x),
    loading = loadings[, i]
  )
  
  # Create a bar plot using ggplot2
  p <- ggplot(loadings_df, aes(x = variable, y = loading)) +
    geom_bar(stat = "identity", fill = "blue") +
    labs(
      title = paste("Principal Component", i),
      x = "Variables", y = "Factor Loadings"
    )
  p <- ggplotly(p)
  p
}

Graph of Individuals:

  • Also known as a ā€œscore plotā€ or ā€œindividual plot,ā€ it displays the data points in the PCA space based on their scores on the principal components.
  • It helps to visualize how individual samples are distributed in the reduced PCA space.
  • Clusters of points close together may indicate similar samples in terms of the measured variables.
# Generate graph of individuals
p <- fviz_pca_ind(pca_df,
                  col.ind = "cos2", # Color by the quality of representation
                  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                  repel = TRUE, # Avoid text overlapping
                  xlab = "PC1",
                  ylab = "PC2")

p <- ggplotly(p)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues
p

Graph of Variables (vars):

  • This plot shows the projection of the original variables onto the first two principal components.
  • It helps to understand how variables are related to each other and how they contribute to the PCA components.
  • Variables close to each other in the plot are positively correlated, while variables far apart are negatively correlated.
# Generate graph of variables
p <- fviz_pca_var(pca_df,
                  col.var = "contrib", # Color by contributions to the PC
                  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                  repel = TRUE, # Avoid text overlapping
                  xlab = "PC1",
                  ylab = "PC2")

p <- ggplotly(p)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues
p

Biplot:

  • A biplot combines the individual and variable plots into a single plot.
  • It helps to visualize both the data points and the variables in the same reduced-dimensional space.
  • This enables understanding the relationships between variables and individuals.
# Generate biplot
p <- fviz_pca_biplot(pca_df,
                     col.var = "#2E9FDF", # Variables color
                     col.ind = "#696969", # Individuals color
                     addEllipses = TRUE,
                     xlab = "PC1",
                     ylab = "PC2")

p <- ggplotly(p)
p

2D t-SNE plot (tsne_2d):

  • This plot represents the data points in a 2D space, using the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm.
  • t-SNE is useful for visualizing high-dimensional data and preserving local structures.
  • Clusters of points that appear close together in this plot are likely to be similar in the original high-dimensional space.
# Generate 2D t-SNE plot
tsne_df_2d <- data.frame(X1 = tsne_df[, 1], X2 = tsne_df[, 2],
                      color = df[[color]], shape = df[[shape]])
p <- ggplot(tsne_df_2d, aes(x = X1, y = X2, color = color, shape = shape)) +
  geom_point(aes(text = paste(
    "Identifier: ", df$gisaid_epi_isl, "\n",
    "Variant: ", df$variant, "\n",
    "Sex: ", df$sex, "\n",
    "Division Exposure: ", df$division_exposure, "\n",
    "Year: ", format(as.Date(df$date), "%Y"), "\n",
    "Strain: ", df$strain, "\n",
    "Pangolin Lineage: ", df$pangolin_lineage
  ))) +
  xlab("TSNE-2D-1") +
  ylab("TSNE-2D-2") +
  scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(text = paste("Identifier: ", df$gisaid_epi_isl, :
## Ignoring unknown aesthetics: text
p <- ggplotly(p)
p

3D t-SNE plot

  • Similar to the 2D t-SNE plot, but in a 3D space, providing additional perspectives on the data distribution.
  • It can help identify more complex and intricate structures within the data.
tsne_df_3d <- data.frame(X1 = tsne_df[, 1], X2 = tsne_df[, 2], X3 = tsne_df[, 3],
                      color = df[[color]], shape = df[[shape]])
final <- cbind(data.frame(tsne_df_3d), df[[color]], df[[shape]])
p <- plot_ly(final,
             x = ~X1, y = ~X2, z = ~X3, type = "scatter3d", mode = "markers",
             color = df[[color]], symbol = ~shape,
             text = paste(
               "Identifier: ", df$gisaid_epi_isl, "<br>",
               "Variant: ", df$variant, "<br>",
               "Sex: ", df$sex, "<br>",
               "Division Exposure: ", df$division_exposure, "<br>",
               "Year: ", format(as.Date(df$date), "%Y"), "<br>",
               "Strain: ", df$strain, "<br>",
               "Pangolin Lineage: ", df$pangolin_lineage
             )
)

p <- ggplotly(p)
p

2D UMAP plot

  • This plot represents the data points in a 2D space using the Uniform Manifold Approximation and Projection (UMAP) algorithm.
  • UMAP is another dimensionality reduction technique similar to t-SNE, but often computationally more efficient.
  • It can reveal different structures or patterns in the data compared to t-SNE.
# Generate 2D UMAP plot
emb <- umap_df$layout
  
x_o <- emb[, 1]
y_o <- emb[, 2]

p <- ggplot(
  data = as.data.frame(emb),
  aes(x = x_o, y = y_o, color = df[[color]], shape = df[[shape]])
) +
  geom_point(aes(text = paste(
    "Identifier: ", df$gisaid_epi_isl, "\n",
    "Variant: ", df$variant, "\n",
    "Sex: ", df$sex, "\n",
    "Division Exposure: ", df$division_exposure, "\n",
    "Year: ", format(as.Date(df$date), "%Y"), "\n",
    "Strain: ", df$strain, "\n",
    "Pangolin Lineage: ", df$pangolin_lineage
  ))) +
  xlab("UMAP_1") +
  ylab("UMAP_2") +
  labs(shape = "shape", color="color") +
  scale_color_brewer(palette = "Set1")
## Warning in geom_point(aes(text = paste("Identifier: ", df$gisaid_epi_isl, :
## Ignoring unknown aesthetics: text
p <- ggplotly(p)
p

3D UMAP plot (umap_3d):

  • Similar to the 2D UMAP plot, but in a 3D space, providing additional perspectives on the data distribution.
  • It allows for exploring the data points’ distribution and potential clusters in a more complex space.
# Generate 3D UMAP plot
final <- cbind(data.frame(umap_df[["layout"]]), color, shape)
p <- plot_ly(final,
             x = ~X1, y = ~X2, z = ~X3, type = "scatter3d", mode = "markers",
             color = df[[color]], symbol = ~shape,
             text = paste(
               "Identifier: ", df$gisaid_epi_isl, "<br>",
               "Variant: ", df$variant, "<br>",
               "Sex: ", df$sex, "<br>",
               "Division Exposure: ", df$division_exposure, "<br>",
               "Year: ", format(as.Date(df$date), "%Y"), "<br>",
               "Strain: ", df$strain, "<br>",
               "Pangolin Lineage: ", df$pangolin_lineage
             )
)

p <- p %>% add_markers()
p <- p %>% layout(scene = list(
  xaxis = list(title = "0"),
  yaxis = list(title = "1"),
  zaxis = list(title = "2")
))

p <- ggplotly(p)
p
## R Session Info:
## ```r
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Philippines.utf8  LC_CTYPE=English_Philippines.utf8   
## [3] LC_MONETARY=English_Philippines.utf8 LC_NUMERIC=C                        
## [5] LC_TIME=English_Philippines.utf8    
## 
## time zone: Asia/Singapore
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] glue_1.6.2            highcharter_0.9.4     data.table_1.14.8    
##  [4] microbenchmark_1.4.10 colorspace_2.1-0      cluster_2.1.4        
##  [7] dendextend_1.17.1     ggdendro_0.1.23       devtools_2.4.5       
## [10] usethis_2.2.2         ggfortify_0.4.16      RColorBrewer_1.1-3   
## [13] tsne_0.1-3.1          Rtsne_0.16            scales_1.2.1         
## [16] factoextra_1.0.7.999  htmlwidgets_1.6.2     umap_0.2.10.0        
## [19] seqinr_4.2-30         gsubfn_0.7            proto_1.0.0          
## [22] validate_1.1.3        kmer_1.1.2            ape_5.7-1            
## [25] forcats_1.0.0         purrr_1.0.1           readr_2.1.4          
## [28] tibble_3.2.1          tidyverse_2.0.0       tidyr_1.3.0          
## [31] stringr_1.5.0         shiny_1.7.4.1         rmarkdown_2.23       
## [34] markdown_1.7          rio_0.5.29            psych_2.3.6          
## [37] plotly_4.10.2         lubridate_1.9.2       httr_1.4.6           
## [40] ggvis_0.4.8           ggthemes_4.2.4        GGally_2.1.2         
## [43] ggplot2_3.4.2         dplyr_1.1.2           plyr_1.8.8           
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.15.0 jsonlite_1.8.5    magrittr_2.0.3    farver_2.1.1     
##   [5] fs_1.6.2          vctrs_0.6.3       memoise_2.0.1     askpass_1.1      
##   [9] rstatix_0.7.2     htmltools_0.5.5   curl_5.0.1        haven_2.5.2      
##  [13] broom_1.0.5       cellranger_1.1.0  TTR_0.24.3        sass_0.4.6       
##  [17] bslib_0.5.0       zoo_1.8-12        cachem_1.0.8      igraph_1.5.0.1   
##  [21] mime_0.12         lifecycle_1.0.3   pkgconfig_2.0.3   Matrix_1.5-4.1   
##  [25] R6_2.5.1          fastmap_1.1.1     digest_0.6.32     reshape_0.8.9    
##  [29] ps_1.7.5          RSpectra_0.16-1   pkgload_1.3.2.1   crosstalk_1.2.0  
##  [33] ggpubr_0.6.0      labeling_0.4.2    fansi_1.0.4       timechange_0.2.0 
##  [37] abind_1.4-5       compiler_4.3.1    remotes_2.4.2     withr_2.5.0      
##  [41] backports_1.4.1   carData_3.0-5     viridis_0.6.3     pkgbuild_1.4.2   
##  [45] ggsignif_0.6.4    MASS_7.3-60       openssl_2.0.6     sessioninfo_1.2.2
##  [49] tools_4.3.1       foreign_0.8-84    quantmod_0.4.24   zip_2.3.0        
##  [53] httpuv_1.6.11     callr_3.7.3       nlme_3.1-162      promises_1.2.0.1 
##  [57] grid_4.3.1        ade4_1.7-22       generics_0.1.3    gtable_0.3.3     
##  [61] tzdb_0.4.0        hms_1.1.3         car_3.1-2         utf8_1.2.3       
##  [65] ggrepel_0.9.3     pillar_1.9.0      later_1.3.1       lattice_0.21-8   
##  [69] tidyselect_1.2.0  phylogram_2.1.0   miniUI_0.1.1.1    knitr_1.43       
##  [73] gridExtra_2.3     xfun_0.39         stringi_1.7.12    lazyeval_0.2.2   
##  [77] yaml_2.3.7        pacman_0.5.1      evaluate_0.21     tcltk_4.3.1      
##  [81] settings_0.2.7    cli_3.6.1         xtable_1.8-4      reticulate_1.30  
##  [85] munsell_0.5.0     processx_3.8.1    jquerylib_0.1.4   Rcpp_1.0.10      
##  [89] readxl_1.4.2      png_0.1-8         parallel_4.3.1    ellipsis_0.3.2   
##  [93] assertthat_0.2.1  prettyunits_1.1.1 profvis_0.3.8     urlchecker_1.0.1 
##  [97] viridisLite_0.4.2 rlist_0.4.6.2     xts_0.13.1        openxlsx_4.2.5.2 
## [101] crayon_1.5.2      rlang_1.1.1       mnormt_2.1.1
## 
## ```
## Session information saved to: R_session_info.txt